library(tidyverse)
library(psych)
library(lavaan)
library(lme4)
library(data.table)
library(readxl)
library(magrittr)
library(haven)
library(stargazer)
library(GGally)
library(car)
library(kableExtra)
library(yarrr)
library(lsmeans)
library(broom)
library(multcomp)
library(lsr)
library(apaTables)
library(MBESS)

Checking your R version and Updating

Every once and a while R releases a new version of the software. This will not download automagically, so you should check the r website https://www.r-project.org/ from time to time, or follow them on twitter. You can see what version you are currently on by entering version into the console.

Installing and Loading Packages

#install.packages()
#require()
#library()

Function Troubleshooting

Calling ?function allows you to look at the arguments for a function as well as their defaults. This is really helpful for when using new functions, knowing what abilities a function has, or for troubleshooting.

?read.csv

Reading in Data

CSV files

There are three main ways to read in csv files. First, you can use base R with the function read.csv. Alternatively you can use the read_csv function from the readr package. The final way is with the fread function from the data.table package.

# Base R
iris.1<- read.csv("./data/iris.csv",row.names=NULL)
# Readr
iris.2<- read_csv("./data/iris.csv")
# data.table
iris.3<- fread("./data/iris.csv")

Excel files

For excel files, you can use the readxl package. Note: if you have multiple sheets you will need to specify which sheet you want to read in (i.e., read_excel("<filename>",sheet = "<sheetname>")

readxl_example("deaths.xlsx")

## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/readxl/extdata/deaths.xlsx"

Deaths<-read_excel(readxl_example("deaths.xlsx"))

Data from a website

In this example we are downloading movie data from GitHub. Note that this isn’t web scraping, this is merely accessing a previously compiled dataset, which in this case is in csv format. Reading in data from a website is helpful if you are accessing archival datasets etc. We won’t get into APIs just yet but this is the first step in that direction.

movies <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/fandango/fandango_score_comparison.csv",header=TRUE)

head(movies)

Delimited Text Files

Sometimes data is saved in a format where values are separated by tabs or some other delimiter. We can use read.table which allows us to specify how the file is delimited.

insect<- read.table(file = 'http://www.ucd.ie/ecomodel/Resources/INSECT.TXT',
                     sep = '\t',
                     header = TRUE,
                     skip=3)

SPSS files

For SPSS files I prefer to use the haven package. Another alternative is the foreign package, but it is a little less consistent than I would prefer.

survey<- read_sav("~/Downloads/survey.sav")

Exploring a Dataset

Getting a sense of the structure and characteristics of a dataset is an important step. There are a few different ways we can examine a dataset.

We can print out the first few rows with head, or the last few rows with tail. It defaults to 10 rows, but you can adjust that by passing a number to the second argument.

head(iris.1)

tail(iris.1)

head(iris.1,20)

We can use describe from the psych package.

psych::describe(iris.1)

We can print a descriptives table and get a headstart on putting together our final documentation with stargazer.

We can use summary from base r.

summary(iris.1)

##        X           Sepal.Length    Sepal.Width     Petal.Length  
##  Min.   :  1.00   Min.   :4.300   Min.   :2.000   Min.   :1.000  
##  1st Qu.: 38.25   1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600  
##  Median : 75.50   Median :5.800   Median :3.000   Median :4.350  
##  Mean   : 75.50   Mean   :5.843   Mean   :3.057   Mean   :3.758  
##  3rd Qu.:112.75   3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100  
##  Max.   :150.00   Max.   :7.900   Max.   :4.400   Max.   :6.900  
##   Petal.Width          Species  
##  Min.   :0.100   setosa    :50  
##  1st Qu.:0.300   versicolor:50  
##  Median :1.300   virginica :50  
##  Mean   :1.199                  
##  3rd Qu.:1.800                  
##  Max.   :2.500

Basic data manipulation

The Tidyverse

The Tidyverse is a collection of packages for reading, manipulating, analyzing, visualizing, and reporting in R. It has gained widespread popularity and rivals base R. Many people either strictly adhere to tidyverse packages, or utilize base R. For this reason, you may see people do the same thing in different ways, and they may recommend different solutions based on their preferred packages. Ultimately this points out one of the advantages of R which is that there are no right answers.

I think that tidyverse is untuitive, consistent across packages, flexible, and in many instances better performing (i.e., speed) than base R. Even if you choose not to use Tidyverse packages I think some of the functions are helpful to know.

Magrittr

Lets start with the basic pipe operator %>%. The pipe operator allows you to run multiple functions on a dataset in succession while cutting down on the number of lines of code and cleaning up your codes appearance. I like to think of the pipe as meaning “and then”.

Here we are calling the iris.1 data set and then grouping that dataset by Species, and then summarising those groups by calculating the mean. Because I do not assign this to a variable the output prints to the console without changing the data. One thing you will notice is that the pipe passes the data at the beginning to the next argument, so I am not specifying a dataset for the two functions, because it knows i’m referring to iris.1. Also remember that the functions are applied consecutively. So its like the results of each row are stored in a temporary dataset and you performing the next set of functions on that new dataset (not the original one).

iris.1 %>%
  group_by(Species) %>%
  summarise(mean = mean(Petal.Length))

With the %<>% assignment pipe, we can save the results to an object. This is equivalent to iris.1<- iris.1. Here

iris.1 %<>% 
  group_by(Species) %>%
  summarise(mean = mean(Petal.Length))

Dplyr

Dplyr is a package containing functions that allow us to manipulate data. Most if not all of the Dplyr verbs have a selection of scoped variants including _if, _at, and _all. Each function has some of its own unique scoped variants as well.

Filter

Filter allows us to.. well.. filter our data based on a formula that we define. You can use one criterion, you can specify that it filters on one criterion or another filter(Sepal.Length >= 5 | Sepal.Width < 3), or you can apply two filter conditions filter(Sepal.Length >= 5 & Sepal.Width < 3).

iris.2 %>%
  filter(Sepal.Length >= 5)

Select

The select function allows us to remove columns, or pick out columns for a new dataset.

You can either pass a list of column names (no quotes necessary), you can use : and specify the first and last column of a range, or you can negative subset to remove columns.

Select comes with some unique modifiers which are useful for dealing with named columns. This includes Starts_with,Ends_with, and contains.

demographics<-survey %>%
  dplyr::select(id:educ)

demographics<- survey %>%
 dplyr::select(-source)

iris %>%
  dplyr::select(starts_with("Sepal"))

Select is also valuable for reordering columns

iris %>%
  dplyr::select(Sepal.Length,everything())

Mutate

The mutate function allows us to compute new variables. By default mutate iterates over columns. You can make it iterate over rows with rowwise(), however this is no longer supported by people such as Hadley Wickham, who suggest using purrr instead.

survey %>%
  mutate(year = 2015) %>%
  dplyr::select(id,year)

Transmute

Transmute is similar to mutate but it only keeps the new variables by default.

survey %>%
  rowwise() %>%
  transmute(op = mean(c(op1,op2,op3,op4,op5,op6,na.rm = TRUE)))

Purrr

This is another package in the Tidyverse,but it does not focus on data manipulation. Rather, purrr is an invaluable resource for applying functions iteratvely across lists,nested datasets. I’m going to keep fleshing this section out so stay tuned.

iris %>% 
  mutate(Max.Len= purrr::pmap_dbl(list(Sepal.Length, Petal.Length), max))

Getting Reliability Estimates

reliability.data<- read_tsv("./data/data.csv")

Extraversion<- dplyr::select(reliability.data,starts_with("E"),-engnat)
Neuroticism<- dplyr::select(reliability.data,starts_with("N"))
Agreeableness<- dplyr::select(reliability.data,starts_with("A"),-age)
Openness<- dplyr::select(reliability.data,starts_with("O"))
Conscientiousness<- dplyr::select(reliability.data,starts_with("C"),-country)

measures<-list(Extraversion,Neuroticism,Agreeableness,Openness,Conscientiousness)
map(measures,psych::alpha)

## Some items ( E2 E4 E6 E8 E10 ) were negatively correlated with the total scale and 
## probably should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' optionSome items ( N2 N4 ) were negatively correlated with the total scale and 
## probably should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' optionSome items ( A1 A3 A5 A7 ) were negatively correlated with the total scale and 
## probably should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' optionSome items ( O2 O4 O6 ) were negatively correlated with the total scale and 
## probably should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' optionSome items ( C2 C4 C6 C8 ) were negatively correlated with the total scale and 
## probably should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option

## [[1]]
## 
## Reliability analysis   
## Call: .f(x = .x[[i]])
## 
##   raw_alpha std.alpha G6(smc) average_r   S/N   ase mean   sd median_r
##      -0.39     -0.39    0.26    -0.029 -0.28 0.016  3.1 0.35    -0.34
## 
##  lower alpha upper     95% confidence boundaries
## -0.42 -0.39 -0.36 
## 
##  Reliability if an item is dropped:
##     raw_alpha std.alpha G6(smc) average_r   S/N alpha se var.r med.r
## E1      -0.49     -0.44    0.23    -0.035 -0.31    0.017  0.22 -0.34
## E2      -0.28     -0.32    0.28    -0.028 -0.24    0.014  0.21 -0.34
## E3      -0.48     -0.44    0.22    -0.035 -0.31    0.017  0.21 -0.34
## E4      -0.24     -0.26    0.31    -0.023 -0.21    0.014  0.21 -0.34
## E5      -0.36     -0.33    0.26    -0.028 -0.25    0.016  0.20 -0.34
## E6      -0.35     -0.38    0.26    -0.031 -0.27    0.015  0.23 -0.34
## E7      -0.46     -0.44    0.20    -0.035 -0.31    0.017  0.20 -0.34
## E8      -0.27     -0.30    0.31    -0.026 -0.23    0.014  0.23 -0.36
## E9      -0.34     -0.32    0.28    -0.028 -0.24    0.016  0.23 -0.34
## E10     -0.19     -0.22    0.34    -0.021 -0.18    0.013  0.22 -0.34
## 
##  Item statistics 
##         n raw.r std.r   r.cor r.drop mean  sd
## E1  19719  0.38  0.36  0.2892  0.030  2.6 1.2
## E2  19719  0.22  0.25  0.1153 -0.152  2.8 1.3
## E3  19719  0.37  0.35  0.3026  0.024  3.4 1.2
## E4  19719  0.15  0.20  0.0047 -0.192  3.2 1.2
## E5  19719  0.29  0.26  0.1814 -0.079  3.4 1.3
## E6  19719  0.27  0.30  0.1578 -0.087  2.5 1.2
## E7  19719  0.40  0.36  0.3807 -0.008  2.9 1.4
## E8  19719  0.20  0.23 -0.0179 -0.163  3.4 1.3
## E9  19719  0.30  0.26  0.0656 -0.100  3.1 1.4
## E10 19719  0.13  0.16 -0.1046 -0.236  3.6 1.3
## 
## Non missing response frequency for each item
##     0    1    2    3    4    5 miss
## E1  0 0.24 0.23 0.28 0.18 0.07    0
## E2  0 0.21 0.24 0.24 0.18 0.13    0
## E3  0 0.08 0.17 0.24 0.28 0.23    0
## E4  0 0.10 0.21 0.28 0.25 0.16    0
## E5  0 0.09 0.16 0.21 0.28 0.25    0
## E6  0 0.26 0.32 0.19 0.15 0.08    0
## E7  0 0.23 0.21 0.18 0.19 0.18    0
## E8  0 0.09 0.19 0.23 0.26 0.24    0
## E9  0 0.17 0.20 0.19 0.23 0.21    0
## E10 0 0.08 0.16 0.19 0.25 0.33    0
## 
## [[2]]
## 
## Reliability analysis   
## Call: .f(x = .x[[i]])
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.71      0.69    0.79      0.19 2.3 0.0026  3.1 0.67     0.38
## 
##  lower alpha upper     95% confidence boundaries
## 0.71 0.71 0.72 
## 
##  Reliability if an item is dropped:
##     raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## N1       0.66      0.63    0.74      0.16 1.7   0.0031 0.142  0.36
## N2       0.80      0.80    0.84      0.30 3.9   0.0019 0.095  0.41
## N3       0.68      0.65    0.76      0.17 1.9   0.0030 0.150  0.39
## N4       0.79      0.78    0.83      0.28 3.5   0.0019 0.125  0.41
## N5       0.67      0.64    0.76      0.17 1.8   0.0030 0.159  0.39
## N6       0.64      0.61    0.73      0.15 1.6   0.0033 0.139  0.36
## N7       0.64      0.62    0.72      0.15 1.6   0.0033 0.142  0.37
## N8       0.63      0.61    0.71      0.15 1.6   0.0034 0.138  0.36
## N9       0.64      0.62    0.74      0.15 1.6   0.0033 0.145  0.36
## N10      0.67      0.65    0.75      0.17 1.8   0.0030 0.145  0.36
## 
##  Item statistics 
##         n raw.r std.r r.cor r.drop mean  sd
## N1  19719  0.68  0.68  0.65   0.56  3.3 1.3
## N2  19719 -0.32 -0.30 -0.50  -0.46  3.2 1.2
## N3  19719  0.60  0.61  0.55   0.48  3.8 1.1
## N4  19719 -0.14 -0.13 -0.32  -0.31  2.8 1.2
## N5  19719  0.64  0.64  0.57   0.51  3.0 1.3
## N6  19719  0.77  0.77  0.76   0.67  3.0 1.3
## N7  19719  0.76  0.75  0.76   0.65  3.2 1.3
## N8  19719  0.78  0.77  0.79   0.68  2.8 1.4
## N9  19719  0.74  0.74  0.72   0.63  3.1 1.3
## N10 19719  0.64  0.63  0.59   0.50  2.8 1.3
## 
## Non missing response frequency for each item
##     0    1    2    3    4    5 miss
## N1  0 0.11 0.20 0.22 0.25 0.22    0
## N2  0 0.08 0.20 0.28 0.28 0.16    0
## N3  0 0.04 0.11 0.16 0.35 0.34    0
## N4  0 0.17 0.27 0.27 0.18 0.10    0
## N5  0 0.15 0.25 0.23 0.24 0.13    0
## N6  0 0.16 0.24 0.22 0.22 0.16    0
## N7  0 0.12 0.22 0.22 0.25 0.19    0
## N8  0 0.22 0.24 0.20 0.20 0.14    0
## N9  0 0.13 0.22 0.21 0.27 0.17    0
## N10 0 0.19 0.25 0.23 0.20 0.13    0
## 
## [[3]]
## 
## Reliability analysis   
## Call: .f(x = .x[[i]])
## 
##   raw_alpha std.alpha G6(smc) average_r   S/N   ase mean   sd median_r
##     -0.024     0.053    0.38    0.0055 0.055 0.011  3.2 0.35    -0.15
## 
##  lower alpha upper     95% confidence boundaries
## -0.05 -0.02 0 
## 
##  Reliability if an item is dropped:
##     raw_alpha std.alpha G6(smc) average_r    S/N alpha se var.r  med.r
## A1      0.103     0.168    0.46    0.0219  0.202   0.0093  0.15  0.044
## A2     -0.073    -0.024    0.32   -0.0026 -0.024   0.0120  0.13 -0.205
## A3      0.123     0.199    0.48    0.0269  0.249   0.0093  0.15  0.064
## A4     -0.176    -0.144    0.22   -0.0142 -0.126   0.0131  0.11 -0.174
## A5      0.185     0.274    0.49    0.0403  0.378   0.0085  0.12  0.032
## A6     -0.203    -0.147    0.26   -0.0144 -0.128   0.0134  0.14 -0.174
## A7      0.187     0.278    0.48    0.0410  0.385   0.0085  0.12  0.032
## A8     -0.161    -0.119    0.28   -0.0119 -0.106   0.0129  0.13 -0.180
## A9     -0.247    -0.207    0.19   -0.0194 -0.171   0.0139  0.12 -0.174
## A10    -0.181    -0.124    0.29   -0.0124 -0.110   0.0132  0.14 -0.205
## 
##  Item statistics 
##         n    raw.r  std.r r.cor r.drop mean  sd
## A1  19719  0.24410  0.142 -0.18 -0.145  2.3 1.4
## A2  19719  0.36191  0.415  0.36  0.062  3.9 1.1
## A3  19719  0.14985  0.086 -0.29 -0.192  2.2 1.2
## A4  19719  0.47230  0.543  0.65  0.197  4.0 1.0
## A5  19719  0.00741 -0.063 -0.39 -0.300  2.2 1.1
## A6  19719  0.50483  0.546  0.54  0.210  3.9 1.1
## A7  19719 -0.00089 -0.071 -0.37 -0.305  2.2 1.1
## A8  19719  0.45551  0.518  0.50  0.180  3.8 1.0
## A9  19719  0.54348  0.601  0.72  0.272  3.9 1.1
## A10 19719  0.47812  0.523  0.45  0.202  3.7 1.1
## 
## Non missing response frequency for each item
##     0    1    2    3    4    5 miss
## A1  0 0.39 0.25 0.13 0.14 0.10    0
## A2  0 0.03 0.08 0.17 0.34 0.37    0
## A3  0 0.40 0.26 0.17 0.13 0.05    0
## A4  0 0.03 0.07 0.14 0.36 0.40    0
## A5  0 0.34 0.35 0.16 0.10 0.05    0
## A6  0 0.04 0.09 0.18 0.31 0.38    0
## A7  0 0.34 0.35 0.17 0.10 0.05    0
## A8  0 0.03 0.09 0.22 0.39 0.26    0
## A9  0 0.04 0.08 0.14 0.37 0.37    0
## A10 0 0.03 0.09 0.29 0.34 0.25    0
## 
## [[4]]
## 
## Reliability analysis   
## Call: .f(x = .x[[i]])
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N    ase mean   sd median_r
##       0.27       0.3    0.53     0.041 0.42 0.0075  3.3 0.38     0.14
## 
##  lower alpha upper     95% confidence boundaries
## 0.26 0.27 0.29 
## 
##  Reliability if an item is dropped:
##     raw_alpha std.alpha G6(smc) average_r  S/N alpha se var.r    med.r
## O1      0.075      0.12    0.39     0.016 0.14   0.0099 0.090 -0.00089
## O2      0.429      0.44    0.59     0.082 0.80   0.0057 0.086  0.19340
## O3      0.223      0.24    0.48     0.034 0.32   0.0083 0.095 -0.00089
## O4      0.381      0.41    0.58     0.071 0.68   0.0061 0.094  0.19340
## O5      0.133      0.14    0.41     0.017 0.16   0.0092 0.090 -0.00089
## O6      0.438      0.47    0.61     0.089 0.87   0.0058 0.088  0.19340
## O7      0.183      0.20    0.47     0.027 0.25   0.0087 0.099  0.01249
## O8      0.075      0.14    0.40     0.017 0.16   0.0100 0.095 -0.00089
## O9      0.218      0.24    0.52     0.035 0.32   0.0083 0.111  0.00937
## O10     0.144      0.15    0.41     0.019 0.18   0.0091 0.083 -0.00089
## 
##  Item statistics 
##         n  raw.r  std.r r.cor r.drop mean   sd
## O1  19719  0.627  0.613  0.64   0.40  3.7 1.12
## O2  19719  0.011 -0.031 -0.27  -0.27  2.1 1.14
## O3  19719  0.403  0.430  0.35   0.15  4.1 1.01
## O4  19719  0.114  0.077 -0.15  -0.17  2.1 1.11
## O5  19719  0.552  0.597  0.61   0.35  3.9 0.94
## O6  19719 -0.064 -0.097 -0.37  -0.32  1.8 1.07
## O7  19719  0.465  0.505  0.42   0.25  4.1 0.92
## O8  19719  0.629  0.595  0.60   0.36  3.2 1.26
## O9  19719  0.409  0.429  0.25   0.17  4.1 0.98
## O10 19719  0.535  0.577  0.60   0.31  4.0 0.98
## 
## Non missing response frequency for each item
##     0    1    2    3    4    5 miss
## O1  0 0.05 0.10 0.24 0.33 0.28    0
## O2  0 0.36 0.31 0.19 0.09 0.04    0
## O3  0 0.02 0.06 0.15 0.31 0.46    0
## O4  0 0.39 0.29 0.21 0.07 0.04    0
## O5  0 0.02 0.05 0.25 0.40 0.28    0
## O6  0 0.53 0.27 0.11 0.06 0.04    0
## O7  0 0.01 0.05 0.17 0.40 0.38    0
## O8  0 0.12 0.18 0.25 0.27 0.18    0
## O9  0 0.02 0.05 0.14 0.34 0.44    0
## O10 0 0.02 0.06 0.20 0.34 0.38    0
## 
## [[5]]
## 
## Reliability analysis   
## Call: .f(x = .x[[i]])
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N  ase mean   sd median_r
##      0.056      0.11    0.36     0.012 0.12 0.01  3.2 0.39    -0.16
## 
##  lower alpha upper     95% confidence boundaries
## 0.03 0.06 0.08 
## 
##  Reliability if an item is dropped:
##     raw_alpha std.alpha G6(smc) average_r     S/N alpha se var.r  med.r
## C1     -0.017    0.0077    0.28   0.00086  0.0078   0.0114 0.096 -0.163
## C2      0.105    0.1796    0.39   0.02374  0.2189   0.0097 0.100  0.016
## C3     -0.033   -0.0102    0.29  -0.00112 -0.0101   0.0116 0.111 -0.203
## C4      0.120    0.2053    0.41   0.02790  0.2584   0.0096 0.095  0.028
## C5      0.086    0.1008    0.34   0.01230  0.1121   0.0102 0.094 -0.163
## C6      0.140    0.2044    0.40   0.02775  0.2569   0.0092 0.093  0.028
## C7     -0.015    0.0129    0.30   0.00145  0.0130   0.0114 0.104 -0.163
## C8      0.162    0.2457    0.44   0.03493  0.3257   0.0093 0.102  0.028
## C9     -0.037   -0.0092    0.27  -0.00102 -0.0091   0.0115 0.096 -0.163
## C10    -0.056   -0.0343    0.27  -0.00370 -0.0332   0.0118 0.106 -0.180
## 
##  Item statistics 
##         n raw.r std.r  r.cor r.drop mean  sd
## C1  19719  0.39 0.457  0.419  0.120  3.3 1.1
## C2  19719  0.30 0.210  0.009 -0.052  3.0 1.4
## C3  19719  0.40 0.478  0.384  0.156  4.0 1.0
## C4  19719  0.24 0.165 -0.058 -0.083  2.7 1.2
## C5  19719  0.29 0.333  0.222 -0.032  2.7 1.2
## C6  19719  0.27 0.166 -0.032 -0.096  2.9 1.4
## C7  19719  0.40 0.451  0.368  0.112  3.6 1.2
## C8  19719  0.13 0.089 -0.228 -0.162  2.5 1.1
## C9  19719  0.44 0.477  0.461  0.132  3.2 1.2
## C10 19719  0.44 0.506  0.456  0.192  3.6 1.0
## 
## Non missing response frequency for each item
##     0    1    2    3    4    5 miss
## C1  0 0.06 0.17 0.29 0.34 0.14    0
## C2  0 0.19 0.21 0.19 0.24 0.16    0
## C3  0 0.02 0.07 0.17 0.37 0.36    0
## C4  0 0.21 0.29 0.24 0.18 0.09    0
## C5  0 0.21 0.26 0.26 0.18 0.10    0
## C6  0 0.21 0.23 0.17 0.22 0.17    0
## C7  0 0.06 0.11 0.23 0.33 0.27    0
## C8  0 0.24 0.26 0.32 0.13 0.05    0
## C9  0 0.11 0.19 0.25 0.28 0.18    0
## C10 0 0.03 0.09 0.31 0.35 0.22    0

Getting Descriptives

describe(survey)

Computing Scale Scores

This is done with the purrr package, which is the Tidyverse version of apply. It allows us to iterate over rows,columns,dataframes in a clean and concise manner. You can also do this with rowwise() and mutate, but again Hadley has issues with that.

Extraversion %>% mutate(score = pmap_dbl(.,function(...) mean(c(...))))

Regression

Here i’ll demonstrate a very basic regression. It’s not the complexity of the statistics, but rather showing how to go through every phase of an analysis.

I need to create a function here which I get to when I clean the data.

replace_over_4<- function(x) replace(x,x>4,NA)

Cleaning the data

I do this all in one pipe. Proceed at your own risk.

health_tbl<- read_tsv("./data/w4inhome_dvn.tab") %>%
  dplyr::select(mh1 = H4MH3, mh2 = H4MH4, mh3 = H4MH5, mh4 = H4MH6,
         jailage = H4CJ20, gender = BIO_SEX4) %>%
  mutate_at(vars(mh1:mh4),replace_over_4) %>%
  mutate(mh1 = 4 - mh1,
         mh4 = 4 - mh4) %>%
  rowwise() %>%
  mutate(mh = mean(c(mh1,mh2,mh3,mh4),na.rm=T)) %>%
  ungroup() %>%
  mutate(gender = factor(gender,levels =c(1,2),labels = c("male","female"))) %>%
  filter(jailage < 97)

Visualization

dplyr::select(health_tbl, 5:7) %>% ggpairs()

dplyr::select(health_tbl, -gender) %>% cor(use ="pairwise")

##               mh1         mh2        mh3       mh4     jailage         mh
## mh1     1.0000000  0.18972172 0.33470223 0.4286938  0.07206430 0.74075325
## mh2     0.1897217  1.00000000 0.27916435 0.2129501 -0.03611862 0.57331576
## mh3     0.3347022  0.27916435 1.00000000 0.4123384  0.04372349 0.70121523
## mh4     0.4286938  0.21295010 0.41233840 1.0000000  0.02069340 0.75779718
## jailage 0.0720643 -0.03611862 0.04372349 0.0206934  1.00000000 0.04231706
## mh      0.7407533  0.57331576 0.70121523 0.7577972  0.04231706 1.00000000

Descriptives

dplyr::select(health_tbl,-gender) %>% apa.cor.table(.,filename="./output/Cortable.doc",table.number=1)

## 
## 
## Table 1 
## 
## Means, standard deviations, and correlations with confidence intervals
##  
## 
##   Variable   M     SD   1           2           3           4          
##   1. mh1     2.47  1.21                                                
##                                                                        
##   2. mh2     2.98  0.97 .19**                                          
##                         [.09, .28]                                     
##                                                                        
##   3. mh3     2.30  0.94 .33**       .28**                              
##                         [.24, .42]  [.18, .37]                         
##                                                                        
##   4. mh4     2.56  1.14 .43**       .21**       .41**                  
##                         [.34, .51]  [.12, .31]  [.33, .49]             
##                                                                        
##   5. jailage 19.16 3.42 .07         -.04        .04         .02        
##                         [-.03, .17] [-.14, .06] [-.06, .14] [-.08, .12]
##                                                                        
##   6. mh      2.58  0.74 .74**       .57**       .70**       .76**      
##                         [.69, .78]  [.50, .64]  [.65, .75]  [.71, .80] 
##                                                                        
##   5          
##              
##              
##              
##              
##              
##              
##              
##              
##              
##              
##              
##              
##              
##              
##   .04        
##   [-.06, .14]
##              
## 
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations 
## that could have caused the sample correlation (Cumming, 2014).
## * indicates p < .05. ** indicates p < .01.
##

Analysis

Model 1

model_1<- lm(mh ~ jailage + gender, data = health_tbl)

par(mfrow = c(2,2))
plot(model_1)

par(mfrow=c(1,1))

tidy(summary(model_1))

model1_augment<-augment(model_1)
ggplot(model1_augment,aes(x=jailage, y=mh,color=gender)) +
  geom_line(aes(y=.fitted))

Model 2

model2<- lm(mh ~ jailage + gender + jailage:gender, data=health_tbl)
par(mfrow=c(2,2))
plot(model2)

par(mfrow=c(1,1))

tidy(summary(model2))

ggplot(health_tbl, aes(x=jailage, y=mh, color=gender, group=gender)) + 
  geom_smooth(method= "lm",se=F)

Comparing Models

anova(model_1,model2)

Comparing Change in \(R^2\)

summary(model2)$r.squared-summary(model_1)$r.squared

## [1] 0.002434573

Anova

I’m going to do the same thing that I did above except this will be an ANOVA.Yes its a tour of the GLM.

Data Import and Cleaning

health_tbl<- read_tsv("./data/w4inhome_dvn.tab") %>%
  transmute(admin_month = IMONTH4,
            gender = factor(BIO_SEX4,levels = c(1,2),labels= c("male","female")),
            living_mother = factor(H4WP1,levels= c(0,1,8),labels=c("No","Yes","Don't Know")),
            fiw = replace(H4LM29, H4LM29 >=6,NA))

Visualization

ggpairs(health_tbl)

Analysis

options(contrasts = c("contr.sum", "contr.poly"))
linear_model<- lm(fiw ~ admin_month + living_mother + gender, data = health_tbl)
anova_model<- Anova(linear_model,type = 3)
anova_model

Descriptives

kable(anova_model,digits = 3, out="./output/anova_model.htm")

	Sum Sq	Df	F value	Pr(>F)
(Intercept)	3235.235	1	4298.538	0.000
admin_month	3.408	1	4.529	0.033
living_mother	2.441	2	1.622	0.198
gender	41.747	1	55.468	0.000
Residuals	3778.233	5020	NA	NA

Diagnostic Plots

plot(linear_model)

pirateplot(formula = "fiw ~ living_mother + gender",data = health_tbl)

Marginal Means Plot

mm_df<- tidy(lsmeans(linear_model,"gender",by="living_mother"))
ggplot(mm_df,
       aes(x=living_mother,
           y=estimate,
           color=gender,
           group=gender)) +
  geom_line()

Tukeys Post-hoc Tests

health_tbl %<>% mutate(cond=interaction(gender,living_mother,sep=" x "))
posthoc_model<- lm(fiw~admin_month + cond, data = health_tbl)
posthocs <- glht(posthoc_model, linfct=mcp(cond="Tukey"))
summary(posthocs)

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = fiw ~ admin_month + cond, data = health_tbl)
## 
## Linear Hypotheses:
##                                              Estimate Std. Error t value
## female x No - male x No == 0                 -0.32298    0.10918  -2.958
## male x Yes - male x No == 0                   0.02272    0.08075   0.281
## female x Yes - male x No == 0                -0.15289    0.08052  -1.899
## male x Don't Know - male x No == 0            0.09060    0.20497   0.442
## female x Don't Know - male x No == 0         -0.14135    0.19725  -0.717
## male x Yes - female x No == 0                 0.34571    0.07805   4.429
## female x Yes - female x No == 0               0.17009    0.07775   2.188
## male x Don't Know - female x No == 0          0.41359    0.20393   2.028
## female x Don't Know - female x No == 0        0.18163    0.19615   0.926
## female x Yes - male x Yes == 0               -0.17562    0.02541  -6.911
## male x Don't Know - male x Yes == 0           0.06788    0.19023   0.357
## female x Don't Know - male x Yes == 0        -0.16407    0.18186  -0.902
## male x Don't Know - female x Yes == 0         0.24349    0.19011   1.281
## female x Don't Know - female x Yes == 0       0.01154    0.18171   0.064
## female x Don't Know - male x Don't Know == 0 -0.23195    0.26186  -0.886
##                                              Pr(>|t|)    
## female x No - male x No == 0                   0.0269 *  
## male x Yes - male x No == 0                    0.9997    
## female x Yes - male x No == 0                  0.3414    
## male x Don't Know - male x No == 0             0.9970    
## female x Don't Know - male x No == 0           0.9730    
## male x Yes - female x No == 0                  <0.001 ***
## female x Yes - female x No == 0                0.1963    
## male x Don't Know - female x No == 0           0.2703    
## female x Don't Know - female x No == 0         0.9216    
## female x Yes - male x Yes == 0                 <0.001 ***
## male x Don't Know - male x Yes == 0            0.9989    
## female x Don't Know - male x Yes == 0          0.9293    
## male x Don't Know - female x Yes == 0          0.7494    
## female x Don't Know - female x Yes == 0        1.0000    
## female x Don't Know - male x Don't Know == 0   0.9343    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Effect size estimates

etaSquared(linear_model, type=3,anova=T)

##                     eta.sq  eta.sq.part          SS   df         MS         F
## admin_month   0.0008912691 0.0009012830    3.408329    1  3.4083293  4.528522
## living_mother 0.0006383947 0.0006457325    2.441305    2  1.2206523  1.621836
## gender        0.0109167729 0.0109286326   41.747165    1 41.7471646 55.467924
## Residuals     0.9879980344           NA 3778.233464 5020  0.7526361        NA
##                          p
## admin_month   3.338298e-02
## living_mother 1.976392e-01
## gender        1.112443e-13
## Residuals               NA

Publication worthy APA table

Planned Contrasts

linear_model_lsm<- lsmeans(linear_model,"living_mother")
contrast(linear_model_lsm, list(knowledge=c(-.5,-.5,1)))

##  contrast  estimate    SE   df t.ratio p.value
##  knowledge   0.0879 0.134 5020 0.657   0.5111 
## 
## Results are averaged over the levels of: gender

R Crash Course

Justin M. Jones

Checking your R version and Updating

Installing and Loading Packages

Function Troubleshooting

Reading in Data

CSV files

Excel files

Data from a website

Delimited Text Files

SPSS files

Exploring a Dataset

Basic data manipulation

The Tidyverse

Magrittr

Dplyr

Filter

Select

Mutate

Transmute

Purrr

Getting Reliability Estimates

Getting Descriptives

Computing Scale Scores

Regression

Cleaning the data

Visualization

Descriptives

Analysis

Model 1

Model 2

Comparing Models

Comparing Change in \(R^2\)

Anova

Data Import and Cleaning

Visualization

Analysis

Descriptives

Diagnostic Plots

Marginal Means Plot

Tukeys Post-hoc Tests

Effect size estimates

Publication worthy APA table

Planned Contrasts